Tried it, but not good as expected.

#11
by kk3dmax - opened

I have tried it, maybe due to 8B size, it could not follow my very complex instruction, and failed to output a vaild JSON format answer.

Yeah, back when HF had a leaderboard these distilled versions completely bombed on tests like IFeval. Not only that, they lost tons of factual knowledge, as well as abilities like poem writing. I just don't see the point. They all perform horrendously at everything but logic, math, and coding. And even then the improvements were minor and unreliable.

Distillation can work. For example, Llama 3.2 3b performs nearly as well as Llama 3.1 8b. If DeepSeek wants to create a smaller distilled version then they have to make their own smaller model like Meta, Google, OpenAI... have done.

Honestly, this model isn't even good at the logic/math/coding aspect. It sacrificed everything for the sake of yapping more. I spent about 6 hours total trying to get usable outputs from it. Here is a little write up I did in the official EXL2/3 Discord server:

"Alright, I have spent hours trying to get this R1 8b distill to be usable. I have matched their system prompt, samplers, everything. This model is a mess

it constantly gaslights itself, misses blatantly obvious stuff, makes no sense, can't count, and is just overal an absolute mess.

Some examples of its huge fails:
"the first three letters of 'HEROINE" is 'HERO'" Which is clearly 4 letters
Then it said "'her' does not signify a female"
It solved the Heroine riddle in 1.2k tokens, and then spent 11k tokens gaslighting itself that it was wrong
It said "C-O-L-T-R-O-L, or 'COLTROOL", where it added an O out of nowhere
Instead of decoding: FROWURO, it decoded FRWUROR, then got hung up for over 7k tokens on how "COTROOL" is not a real word, and I must have made a typo
This model's "vibe" seems to be getting something correct reasonably fast, then spending 5-20k tokens gaslighting itself on how what it just solved is wrong, before assuming the "user must be wrong", and then backtracking on everything, quite reliably giving the WRONG answer while also suggesting the right answer as something to be explored more. I would have to say about 70% of questions this model got wrong, it had reasoning chains considerably closer to the correct answer that it gaslit itself out of following for seemingly no reason other than "No, that doesn't make sense" when it very clearly does.

With a simple +3 cipher, I gave it 12 attempts. It passed 4 times, and it averaged over 18K tokens of reasoning... for nothing.

For a model that claims to rival Q3 235b, it sure can't do simple 1 2 3 counting.

I have given this model several tests that many models in the last 5 months can pass, and I would say this one has about a 60% failure rating on simpler questions, and about a 90% failure rating on complex questions. I find it inferior to the DeepScaleR 1.5b preview from several months ago in a lot of key ways (Specifically math, were DeepScaleR 1.5b considerably outperforms it in both math ability, and token efficiency)

For the sake of diligence, I tried the following versions: EXL2: 6bpw, 8bpw, FP16. EXL3: 6bpw, 8bpw. GGUF: Q8, FP16. All of them were just as incompetent. This model has to be one of the worst CoT models I have tried in a long time, which is crazy, cause its based off a fairly competent base (Qwen 3 8b)"

@SytanSD I had similar issues with the source Qwen3 8b model. It failed to answer simple questions that much smaller models like Llama 3.2 3b reliably got right, such as what's the third rock from the sun (Earth). So I suspect the primary issue is that DeepSeek used Qwen3, which so egregiously overfit to the standard LLM tests that they're riddled with pockets of profound ignorance, making them frustratingly unreliable across a spectrum of real-world tasks.

Sign up or log in to comment